Harmonic-Net: Fundamental Frequency and Speech Rate Controllable Fast Neural Vocoder

نویسندگان

چکیده

There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving controllability fundamental frequency ( $f_{\mathrm{o}}$ ) and rate (SR). For this purpose, we propose Harmonic-Net Harmonic-Net+, which introduce two extended functions into HiFi-GAN generator. The first extension downsampling network, named excitation signal that hierarchically receives multi-channel signals corresponding . second layerwise pitch-dependent dilated convolutional network (LW-PDCNN), can flexibly change its receptive fields depending input handle large fluctuations in for upsampling-based proposed explicit LW-PDCNNs are expected realize high-quality normal -conversion conditions SR-conversion condition. results experiments unseen speaker synthesis, full-band singing voice text-to-speech show method with harmonic waves achieve higher than conventional methods all (i.e., normal, -conversion, SR-conversion) conditions.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

the effects of speech rate,prosodic features, and blurred speech on iranian efl learners listening comprehension

کلید واژه ها به زبان انگلیسی: effect of speech rate on listening comprehension, blurred speech,segmental and suprasegmental features,authentic speech,intelligibility, discrimination, omission, assimilation چکیده: سرعت مطالب شنیداری در کلام پیوسته بطور کلی همواره کابوسی بوده برای یادگیرنده های زبان دوم و بالاخص برای شنوندگان ایرانی. علی رغم عقل سلیم که کلام با سرعت کندتری فعالیتهای درک مطلب شن...

15 صفحه اول

Fundamental Frequency Estimation for Noisy Speech Using Entropy-Weighted Periodic and Harmonic Features

SUMMARY This paper proposes a robust method for estimating the fundamental frequency (F0) in real environments. It is assumed that the spectral structure of real environmental noise varies momentarily and its energy does not distribute evenly in the time-frequency domain. Therefore, segmenting a spec-trogram of speech mixed with environmental noise into narrow time-frequency regions will produc...

متن کامل

Intelligibility of frequency-lowered speech produced by a channel vocoder.

Frequency lowering is a form of signal processing designed to match speech to the residual auditory capacity of a listener with a high frequency hearing loss. A vocoder-based frequency-lowering system similar to one studied by Lippmann was evaluated in the present study. In this system, speech levels in high frequency bands modulated one-third-octave bands of noise at low frequencies, which wer...

متن کامل

Fast Neural Net Simulation with

This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC, 1 is operational and runs the back-propagation algorithm at a speed of 247 million connection updates per second (continuous weight update) using 32 bit oating-point precision. This is equal to 1 GGops sustained performance. The complete sys...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2023

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2023.3275032